Source: Dunzo Blog

Dunzo Blog A Deep Dive into Dunzo Data Platform | Part-1

Dunzo’s growth has been immense in the last few years. We also manage hundreds of micro-services and hence we handle a variety of data, data sources, and components.Why did we build the platform?Earlier we used to run into services mode, where we used to get requests from teams to onboard data and then we used to write the pipeline, but as scale hits you will run into problems with services mode. Hence we decided to move to a platform approach where we will provide Self Serve platform to teams instead of ServicesDunzo Data Platform (DDP) provides various capabilities to handle this large volume of data. We monitor different metrics to provide well-defined and scalable capabilities to our Engineering, Analytics, Product, and Finance teams so that they can take the right decisions at the right time. The core metrics we track are Data Correctness, Data Freshness, Data Completeness, Data Security, CapEx, and OpEx. Overall following are the offerings given by DDPData StreamsData Ingestion using the Self-Serve PlatformDimensional Modelling using DBTData SecurityData DiscoverabilityData ObservabilityData Consumption Capabilities and ToolsIn this Part-1 we will cover high-level Architecture of the Data Ingestion and Data Streams.Dunzo is completely on the Google Cloud Platform and we use various GCP offerings and some open-source systems as well wherever needed. We also ingest data from multiple data sources in real-time and batch mode.High-Level Ingestion FlowReal-Time Data StreamsOur internal data sources where we ingest data in real time, consist of the followingPostgresMost Dunzo microservices use Postgresql as a data source. Postgres provides WAL (Write Ahead Logs) using which we replicate the data to consumers and data warehouse. Debezium Server is used for this purpose.MongoDBMany Dunzo microservices also use MongoDB as a data source. Mongo provides OpLog (Operations log) using which we replicate the data to consumers and the data warehouse. Debezium Server is used for this purpose.SpannerSome of Dunzo's microservices use Spanner as a NewSQL database. Spanner provides Change Streams using which we replicate the data to consumers and the data warehouse. Apache Beam’s SpannerIO is used for this purpose.Debezium Server & Google Cloud PubSubWe use the Debezium server to read WAL (in the case of Postgres) and OpLog(in the case of Mongo) and then send them to a sink. The sink we use is Google PubSub which is a publisher and subscriber-based messaging platform.Extraction pipeline using Google Cloud DataflowWe use Apache Beam pipelines to extract data from pub-sub, then convert it from the Debezium payload to the Dunzo internal payload. We implement the Beam runner as Google Cloud Dataflow. We also do some enrichments to include the full updates in the payload. Then this pipeline publishes to a different pub-sub (Data Streams)Real-Time Data Streams ArchitectureIn the case of Spanner the Extract pipeline directly read from Change streams using SpannerIO and then publishes to a different pub-sub (Data Streams)Real-Time Data IngestionWe use the above sink which was created by the Real-Time Data Streams as a data source aka (Data Streams) and then Transform and Load data to the Data Warehouse. We also ingest Data from the Real-time streams exposed by various microservices in Dunzo.Transform pipeline using Google Cloud DataflowWe use Apache Beam pipelines to extract data from pub-sub, then convert it from the Dunzo internal payload to do data type changes and renamings as per the defined schema for the table in the Schema Registry (Schema in Avro format).Schema RegistryWe use Schema Registry to store the schema for all the tables which is being used for transformation and the validation of incoming data.Load pipeline using Google Cloud DataflowWe use Apache Beam pipelines to extract data from transformed data pub-sub, then convert it to the Bigqeury(Data Warehouse) record payload to do the Streaming insert into BigQuery.Data Warehouse: BigQueryWe use BigQuery for storing all our Big Data. All the data is ingested into different BigQuery tables.Near Real Time Data Transformation and Load to Data WarehouseIn the case of tables, one pub-sub topic per table and one BQ table for a table get created.In the case of events, one BQ table per event_type gets created.Custom Microservice events Ingestion ArchitectureBatch Data IngestionOur external data sources where we ingest data in batch mode, consist of the followingAPI PollersWe pull various data from third parties using their APIs using scheduled pollers and we push them to our data warehouse. We use Airflow for scheduling and orchestrating these DAGs.Data on Object Store Bucket (GCS)In some cases, we get the data from 3rd parties in our GCS (Google Cloud Storage) buckets and we process them and push them to our data warehouse. We use Airflow for scheduling and orchestrating these DAGs.All the batch pipelines are orchestrated by Airflow, we use Astronomer as managed airflow.Note-Real Time System mentioned here is near real-time, which may have a few milliseconds lag.ConclusionTo manage and orchestrate this we have built a Self Serve UI tool so that Engineering teams in Dunzo can easily onboard these Data Sources to Data Streams or Data-Warehouse quickly.We will be covering more around the low-level details and challenges we faced in upcoming parts of this series. Stay tuned.#dataengineering #dunzo #engineeringA Deep Dive into Dunzo Data Platform | Part-1 was originally published in Dunzo on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read full article »
Est. Annual Revenue
$25-100M
Est. Employees
1.0-5.0K
Kabeer Biswas's photo - Co-Founder & CEO of Dunzo

Co-Founder & CEO

Kabeer Biswas

CEO Approval Rating

76/100

Read more